Counting and generating lambda terms 



(N 



Katarzyna Grygiel * 
Theoretical Computer Science Department, 
Faculty of Mathematics and Computer Science, 

Jagiellonian University, 
ul. Prof. Lojasiewicza 6, 30-348 Krakow, Poland 
email:- grygiel@tcs.uj.edu.pl 



Pierre Lescanne 
ENS de Lyon, 
LIP (UMR 5668 CNRS ENS Lyon UCBL INRIA) 
^ University of Lyon, 

i 46 allee d'ltalie, 69364 Lyon, France 

^ email: pierre.lescanne@ens-lyon.fr 

>-j December 18, 2012 

o 

(N 

Abstract 

o 

Lambda calculus is the basis of functional programming and higher order proof 
^ assistants. However, few is known about combinatorial properties of lambda terms, 

in particular, about their asymptotic distribution and random generation. Among 
others, this paper tries to answer questions like: How many terms of a given size 
are there? What is a "typical" structure of a simply typed term? Despite their 
ostensible simplicity, these questions still remain unanswered, whereas solutions to 
such problems are essential for testing compilers and optimizing programs whose 
^ expected efficiency depends on the size of terms. Our approach toward the afore- 

mentioned problems may be later extended to any language with bounded variables, 
i.e., with scopes and declarations. 

This paper presents two complementary approaches: one, theoretical, uses com- 
plex analysis and generating functions, the other, experimental, is based on a com- 
puter algebra software, able to handle huge numbers efficiently. Thanks to de Bruijn 
indices, we provide formulas for the number of closed lambda terms of a given size 
and show their relevance to recursively defined integer polynomials. Knowledge 
on the asymptotic behavior of the polynomial coefficients suggests the approach 
toward the problem of the asymptotic behavior of numbers of closed lambda terms. 
Indeed, this problem is unamenable to standard generating function methods due 
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to the unusual form of the recurrences. As a by-product of the counting formulas, 
we design an algorithm for generating lambda terms. Performed tests provide us 
with experimental data, like the average depth of bound variables and the average 
number of head lambdas. We also create random generators for various sorts of 
terms. Thereafter, we conduct experiments that answer questions like: What is the 
ratio of simply typed terms among all terms? (Very small!) How are simply typed 
lambda terms distributed among all lambda terms? (A typed term almost always 
starts with an abstraction.) 

In this paper, variables have size 0. 

Keywords: Lambda calculus, combinatorics, functional programming, test, ran- 
dom generator, Catalan numbers 

1 Introduction 

Let us start with a few questions relevant to the problems we address. 

• How many closed A-terms are of size 50 (up to a-conversion)? 

996657783344523283417055002040148075226700996391558695269946852267. 

• How many terms of size n are there? 

We will give a recursive formula for this number in Section [2j 

• What is enumerated by the sequence 

0, 1, 3, 14, 82, 579, 4741, 43977, 454283, 5159441, 63782411? 

This sequence enumerates closed terms of size 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. We 
will give three ways to compute it (Section |4p. 

• Is it possible to generate simply typed terms randomly? 

Yes, according to the process, which consists in generating random lambda terms 
with uniform probability and sieving those that are simply typed. Thus, we can 
generate randomly simply typed terms of size up to 50 and less randomly simply 
typed terms of size 200. 

• Is a term starting with an abstraction more likely to be typable than a term starting 
with an application? 

The answer is positive as shown in Figure [9j which gives the distribution of simply 
typed lambda terms among all lambda terms. 

• Do these results have practical consequences? 

Yes, they enable random generation of simply typed terms in an efficient way in 
the case of terms of size up to 30 (random) and up to 100 (biased) in order to 
debug compilers or other programs, manipulating terms, e.g., type checkers or pretty 
printers. 
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The above questions seem rather classical, but amazingly very few is known about 
combinatorial aspects of lambda terms. However, the answers to these questions are 
extremely important not only for a better understanding of the structure of lambda 
terms, but also for people who build test samples for debugging compilers or for those 
who optimize the average run of programs by a better knowledge of the distribution of 
terms. Perhaps the reason of this ignorance lies in the surprising form of the recurrences. 
Indeed, due to the presence of bound variables, the recurrence does not work in the way 
mathematicians expect and are used to. Thus none of the methods used in the reference 
book of Flajolet and Sedgewick [0J applies. Why is that? In what follows we compute the 
number of lambda terms (and of normal forms) of size n with at most m bound variables. 
Denoting the number of such terms by T n m , the formula for T n m contains T n _ l m+1 and 
this growth of m makes the formula averse to treatments by generating functions and 
classical analytic combinatorics. We notice that for a given n the expression for T n ^ m is a 
polynomial in m. These polynomials can be described inductively and their coefficients 
are given by recurrence formulas. These formulas are still complex, but can be used 
to compute the constant coefficients, which correspond to the numbers of closed lambda 
terms. For instance, the leading coefficients of the polynomials are the well known Catalan 
numbers which count binary trees. 

In order to find the recurrence formula for the number of A-terms of a given size, we 
make use of the representation of variables in A-terms by de Bruijn indices. Recall that a 
de Bruijn index is a natural number which replaces a term variable and enumerates the 
number of A's encountered on the way between the variable and the A which binds the 
latter. In this paper, we assume the combinatorial model in which the size of each occur- 
rence of abstraction or application is counted as 1, while the size of variables (de Bruijn 
indices) as 0. This method is a realistic model of the complexity of A-terms and allows us 
to derive the recurrences very naturally. Since we manipulate big numbers, we need for 
the computation an efficient computer algebra system. We have chosen PARI/GP [16J, 
a package of the software SAGE [T5] 

From the formula for counting A-terms we can derive one-to-one assignments of num- 
bers in the interval [l..P n (m)] to terms of size n with at most m distinct free indices. From 
this correspondence, we can develop a program for generating A-terms, more precisely 
for building the A-term associated with a number in the interval [l..P n (m)]. If we pick 
a random number in the interval, then we get a random term of size n with at most m 
distinct free variables. Beside the interest in such a random generation for applications 
like testing, this allows us to compute practical values of parameters by Monte-Carlo 
methods. Overall, we are able to build a random generator for simply typed terms. Un- 
like the method used traditionally [13], which consists in unfolding the typing tree, we 
generate random A-terms and test their typability, until we find a simply typed term. 
This method allows us to generate on a laptop simply typed A-terms up to size 50. We 
also use this method to describe the distribution of well-typed terms among plain terms 
and well-typed normal forms among plain normal forms. This shows that terms starting 
with abstractions are more likely to be typable than terms starting with applications, 
the phenomenon being more manifest on normal forms. From this, we derive a way to 
generate large typed terms up to 200, with a biased randomness. 
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Related works 



There are very few works on counting lambda terms, whereas counting first order terms 
is a classical domain of combinatorics. Apparently the first traces of counting expres- 
sions with (unbound) variables can be attributed to Hipparchus of Rhodes (c. 190-120 
BC) (see [5] p. 68). Flajolet and Sedgewick book [5] is the reference on this subject. 
Concerning counting A-terms, we can cite only four works. [3] and [I] study asymptotic 
behavior of formulas on counting lambda-terms. Strictly speaking they do not exhibit a 
recurrence formula for counting. In particular, David et al. [3] only bound superiorly and 
inferiorly the numbers of A-terms in order to get information about the distribution of 
families of terms. For instance, they prove that "asymptotically almost all A-terms are 
strongly normalizing" . In [TD] the second author of the present paper proposes formulas 
for counting A-terms in the case of variables of weight 1, with more complex formulas 
and less results. On another hand, Christophe Raffalli proposed a formula for counting 
closed A-terms, which he derives from the formula for counting A-terms with exactly m 
distinct free variables, whereas in this paper we count terms with at most m such vari- 
ables. His formula appears only in the On-line Encyclopedia of Integer Sequences under 
number A135501 and is much more complicated with three embedded levels of E's. He 
considers weight 1 for the variables. His formula can be easily adapted to variables of 
weight 0, but remains complex (see Section [5] and Appendix [A|). 

As concerns random generation, [17] proposed algorithms for randomly generating 
untyped A-terms in the spirit of the counting formula of Raffalli. [T2l IT3"] use generation 
of A-term to test Haskell compilers. Palka acknowledges that, due to his method, he 
cannot guarantee the randomness of his generator (see discussion in [12] p. 21 and p. 45). 
Nonetheless, he found eight failures and four bugs in the Glasgow Haskell Compiler 
showing the interest in the method. [H] study the feasibility of generic programming 
for the enumeration of typed terms. The given examples are of size 4 or 5, no realistic 
examples are provided, randomness is not addressed and the authors confess that their 
algorithm is not efficient. Knowing that there are 11807 simply typed closed terms of 
size 7, one wonders the actual use of such an enumeration and it seems unrealistic to 
address enumeration for larger numbers. The "related work" section of [H] covers similar 
approaches, which all consist in cutting branches. They all fail to generate random terms. 
A presentation of tree-like structure generation and a history of combinatorial generation 
is given in [8]. 

Structure of the paper 

According to its title, the paper is divided into two parts, one focuses on counting terms 
and its mathematical treatment, the other on term generation and its applications. The 
first part (Sections [2] and [5]) is devoted to the formulas counting A-terms. In Section [2] 
we study polynomials giving the numbers of terms of size n with at most m distinct free 
variables, especially formulas giving the coefficients of the polynomials. In Section |3j we 
shows that the numbers of i-contexts give a combinatoric interpretation of the coefficients 
of the polynomials and yield a new formula for counting the closed terms of size n. If 
we add formulas for counting lambda terms of size n with exactly m variables, we have 
three formulas of three different origins for counting closed terms which we describe and 
compare in Section |4j In Section [5] we derive generating functions and asymptotical values 
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for these coefficients. In Section [6] we give a formula for counting normal forms. In the 
second part, i.e., in Section [7] and Section |8j we propose programs to generate untyped 
and typed terms and normal forms. Section [9] is devoted to experimental results. SAGE 
script related to this paper can be found at 

https : //dl . dropbox . com/u/2518969/LambdaTermsEnumerationAndGeneration. sws 
and raw statistics can be found at 

https : //dl . dropbox . com/u/2518969/Statistics . txt 



2 Counting terms with at most m variables 

We represent terms by de Bruijn indices [I], which means that variables are represented 
by numbers 1, 2, . . . , m, . . ., where an index, for instance k, is the number of A's, above 
the location of the index and below the A that binds the variable, in a representation 
of A-terms by trees. For instance, the term with variables Xx.Xy.xy is represented by 
the term with de Bruijn indices AA21. The variable x is bound by the top A. Above the 
occurrence of x, there are two A's, therefore x is represented by 2 and from the occurrence 
of y, we count just the A that binds y; so y is represented by 1. In what follows we will 
call terms, the untyped terms with de Bruijn indices and often we will speak indifferently 
of variables and (de Bruijn) indices. Assume that not all the indices are bound. In other 
words, there may be indices that do not correspond to surrounding A's, we call them 
"free". Here is the convention on the interval of "free" indices which appear in a term 
t. An interval is a set X(m) = {1,2, ... , m} of indices, If t is an index i, the interval of 
indices of t is any interval [1, . . . , m] with 1 < i < m. Now assume that the interval of free 
indices of t is [1, . . . , m + 1] , then the interval of free indices of At is [1, . . . , m] , because 
the indice 1 have been bound and the others are assumed to decrease by one. If the 
interval of indices of t and s is [1, . . . , m] , then the interval of indices of s t is [1, . . . , m] . 
For instance, the interval of the term A3 1 is the interval [1,2, ... , m] for any m > 2. 

Let us denote the set of terms of size n, with at most m "free" de Bruijn indices (with 
[1, . . . , mj as interval of indices) by 7^ >m . A term from 7^ jm is either a de Bruijn index 
or an abstraction on a term with at most m + 1 indices, i.e., a term in 7~ n ,m+i, or an 
application of a term in T n ,m on a term in T n , m - We can write, using @ as the application 
symbol, 

n 

,m ■ 

We assume that the operators A and @ have size 1 and that de Bruijn indices have 
size 0. From this, we get the following two equations specifying T n ^ m : 

This means that there are m terms of size with at most m free de Bruijn indices, which 
are terms that are just these indices. Terms of size n with at most m de Bruijn indices 
are either abstractions with at most m + 1 indices on a term of size n — 1 or applications 
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Figure 1: Values of T nj7n for n and m up to 14 and 6, respectively 

n Pn 

m 

1 m 2 + m + 1 

2 2m 3 + 3m 2 + 5m + 3 

3 5m 4 + 10m 3 + 22m 2 + 25m + 14 

4 14m 5 + 35m 4 + 94m 3 + 154m 2 + 163m + 82 

5 42m 6 + 126m 5 + 396m 4 + 838m 3 + 1277m 2 + 1235m + 579 

6 132m 7 + 462m 6 + 1654m 5 + 4260m 4 + 8384m 3 + 11791m 2 + 10707m + 4741 

7 429m 8 + 1716m 7 + 6868m 6 + 20742m 5 + 49720m 4 + 90896m 3 + 120628m 2 + 104055m + 43977 

8 1430m 9 + 6435m 8 + 28396m 7 + 98028m 6 + 275886m 5 + 617096m 4 + 1068328m 3 + 1352268m 2 + 1117955m + 454283 



Figure 2: The first eight polynomials P., 



of terms with at most m indices to make a term of size n. As we said in the introduction, 
the 11 first values of T nj o are: 

0, 1, 3, 14, 82, 579, 4741, 43977, 454283, 5159441, 63782411. 



Figure [T] gives all the values of T n _ m for n up to 14 and m up to 6. For instance, 
there is 1 closed term of size 1, namely Al, there are 3 closed terms of size 2, namely 
AAl, AA2, Al 1, and there are 14 closed terms of size 3, namely 

AAAl, AAA2, AAA3, AAl 1, AAl 2, AA2 1, AA22, A(l Al), 
A(l A2), Al(ll), A((A1) 1), A((A2) 1), A((l 1) 1), (Al) Al. 

Notice that in Section [7j we describe how to assign a number to a term and therefore 
how to list terms with increasing numbers. The above terms have been listed in that 
order. 

For every n > 0, we can associate with T n m a polynomial P n (m) in m. First, let us 
define polynomials P n in the following recursive way: 

Po(m) = m 

n 

Pn+i(m) = P n (m + l) + J2 P *( m ) P n-tM 

i=Q 

The sequence (-Pn(0)) n>0 corresponds to the sequence (7 1 n,o) n > enumerating closed 
lambda terms. The first eight polynomials are given in Figure [2] 

This means that the constant coefficient of a polynomial P n (m) is exactly the number 
of closed lambda terms of size n. We propose a way of computing the coefficients of these 
polynomials. 
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Lemma 1 For every n, the degree of the polynomial P n is equal to n + 1. 

Proof: The result follows immediately by induction on n from the definition 
of P n . □ 

For % > and n > 0, let us denote by p l n the i-th leading coefficient of the polynomial 
P n , i.e., we have 

P n (m) = p l n m n+l + p 2 n m n + ...+ p\m n+2 ~ l + ...+ p n n +1 m + p n n +2 . 
Lemma 2 For every n > and i > 0, 

Po = 1, Po = M * > 1, 



7=0 ^ J ' k=l j=0 



Proof: Since Po(m) = m, equations from the first line in the above lemma are 
trivial. 

The i-th leading coefficient in the polynomial P n+ i(m) is equal to the sum 
of coefficients standing at m n+3 ~ % in polynomials P n (m+1) and YTj=o Pj( m )Pn-j( m )- 
The first of these polynomials, P n (m + 1), is as follows: 



p\{m + l) n+1 + ...+ p ^\m+ 1)"+ 3 - + . . . + pl+\ 
therefore the coefficient of m n+z ~ l in P n (m + 1) is equal to 

n + l\ , / n \ 2 (n + 3 — i\ ,-_ , 4-4 /n + 1 — j 

i -2> i+ (i-3>» + - + ( o >» 1 = E( i _ 2 _- 

In the case of the second polynomial, YTj=o Pj{ m )Pn-j{m) , we have 

+ . . . + j^m i+2 - k + . . . + p> +2 ) 
■ (Pl-jm^ 1 + ...+ p^m^ 1 ^ + ...+ pTj; 2 ) , 

therefore the coefficient of m n+ ^ % in X]J=o ^i( m )-^™-i( m ) is equal to 



fc=l j=0 

□ 
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3 Counting contexts 



In A-calculus, a i-context is a closed term with i holes. We consider that a hole has size 
and we assume the holes are numbered 1 , . . . , i as we meet them when traversing the term 
from left to right. For instance, if we represent the holes by [ ], then (Al[ ])AA[ ]2 is a 
2-context of size 4, its holes are numberd as follows (Al[ ]i)AA[ ]22. The 0-contexts are 
the closed terms. Therefore there is only one 1-context of size and no context of size 
for j ^ 0. Let us write c n ^ the number of z-contexts of size n. One notices that 

c ,i = 1 

c 0ii = for i ^ 1. 

Let us now see how we build a context from smaller contexts. 

By abstraction, a z-context of size n + 1 can be built from a j-context of size n and a 
set of j — i holes among the j holes of the j-context where one puts variables (or 
indices) to be abstracted. There are (^) c n j such z-contexts. One has to sum those 
quantities from i to n + 1 to get the numbers of i-contexts built this way. 

By application, a i-context of size n + 1 can be built by applying a context on another 
context, i.e., a j-context of size k applied on a i — j-context of size n — k (recall 
that the composition operator has size 1). To get all the contexts built this way, 
one has to sum from j = to j = % and form k = to k = n. 

Hence we get the formula: 

n+1 / -\ in 
C n +l,i = ^ f . J C n j + ^ ^ Ck,jC n -k,i-j- (*) 
j=i ^ ' j=0 fc=0 

From contexts we can see how we can build terms. More precisely from a i-context of 
size n and a map / from [l..z] to [l..m], we can insert the index f(j) in the j th hole to 
build a term of size n with % occurrences of free variables taken among m ones. There 
are c n ^m l such terms. Therefore 

T n)Tn = c n ^ n+ im n+1 + ... + c n ^m % + ... + c n $ 

is the number of A-terms of size n with at most m variables, which is the polynomial P n . In 
particular c nj „ + 2-i = p l n . The coefficients c n> j of the polynomials P n 's count the i-contexts 
of size n. We see that c n ^ = when i > n + 1. 



The case i = n + 2. In the case when i = n + 2, using the fact that c n ^ = 0, when 
i > n+1, the equation (*) boils down to: 



c n+l,n+2 — ^ c k,k+l c n-k,n-k+l 



k=0 



which is characteristic of the Catalan numbers. Indeed n + 1-contexts of size n have no 
abstraction, only applications and are therefore binary trees. 
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4 Three formulas for counting closed terms 

We have found three formulas to compute the number of closed terms of size n. Let us 
summarize them: 

Case m = for terms with at most m distinct free variables 

T n0 where 



T 

- 1 n+l,m 



m 



i=0 



This formula is clearly the simplest. Its simplicity, one sum and no binomial, allows 
unfolding it and on this basis building a program for term generation. 



Case m = for terms with exactly m distinct free variables 

f nfi where 



fo,o 
fo,i 

fn,m 
fn+l,m 





1 

if m > n + 1 

fn,m fn,m+l ~t~ 



n m m—c 



m \ m — c 



fp,k+cfi 



n—p,m—k ■ 



p=0 c =0 fc=0 

This formula is the most complex. The way it can be derived is given in Appendix A.l 



In Appendix A. 2 we give a simple connection between the T njm 's and the / n , m 's. 



0-contexts 

c n fi where 



Co,l 
CQ,i 



1 

for i ^ 1 

n+l 

'J 



E 



i n 



j=0 k=0 



5 Generating functions for the coefficients of the P n 's 

For every positive integer i, let us denote by a; the generating function for the sequence 

(K)n>0> i- e -> 

oo oo 



n=0 



n=0 
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The p l n 's count the number of contexts of size n having n + 2 — i holes i.e., having almost 
as many holes as their size, where "almost as many as" means "except a fixed number 
i — 2". For the sake of clarity, instead of writing a,i(z) sometimes we simply write cij. 

In order to compute functions a^, we apply the following basic fact about generating 
functions. 

Fact 3 Let f and g be generating functions for sequences (/ n ) n > an d (9n) n>0 , respec- 
tively. Then 



(i) the generating function for the sequence (Q) fn) n>0 , where k is a fixed positi 



ive 



■ i z k f ( - k 1 

integer, is given by — h — 



(ii) the generating function for the sequence (J^ILo fi9n-i) n>Q given by f ■ g. 

(Hi) the generating function for the sequence {{ n ~ J ) f n ) n>0 , where i > and j > 0, is 
given by 



j-i ) u a-k)\ 



Proof: Items (i) and (ii) can be found, e.g., in Chapter 7 of [6]. 
The third part follows from (i) and the following equality: 



which holds for every n, i > and j > 0. This equality can be easily derived 
from two equalities known as "upper negation" and "Vandermond convolu- 
tion", which can be found in Table 174 of [6]. □ 

Now we are ready to provide a recurrence for functions dj. 

Theorem 4 The following equations are valid: 

ai = za\ + 1, ax(0) = 1 
a 2 = za\ + 2zai<2 2 

(i-2) (i-3) (i-3) 



+ Z :'"V rr+Z 



i-2)\ (i-3)! (i-3)! 

i— 3 i— 3— j . 1 v (i— 3— j— fc) 



j=l fc=0 



,fe fk+j — 1\ J-3-.j-k a }+2 



3-1 J (i-3-j-h)\ 



+z ■ 2j aj-Oi-j+i, /or i > 2. 
Proof: All these equations follow from Lemma [2] and Fact [3} □ 
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ai(z) 
a 2 {z) 

a 3 (z) 

a±{z) 

a 5 (z) 

a G (z) 

a 7 (z) 



1 _ (1-4,)V 2 \ 

2 2 



+ 



2 ' 2 (l-4z)V2 
1 ^ 



l-4z (l-Azfl 2 
3 z 



(l-4z) 2 (l-4z) 5 / 2 

4^ + 9 ^ 2 - 19^ + 5 
+ 



(l-4z) 3 
24,2 + 31 



+ 



(i -A Z yi 2 

3z 2 - 203z + 51 



(l-4z) 4 (l-4z) 9 / 2 

16z 2 - 128z + 181 2z 3 - 194z 2 - 1541z + 398 
- + 



(1 -4z) 5 



(1 _ 4^)11/2 



Figure 3: The generating functions for the coefficients of the polynomials P n {m) 



Notice that the Oj's can be computed by induction. Indeed Oj occurs twice in the 
lefthand side of the last equation and we have: 



<2j(l — 2ax 



.i-i a i 



(i-2) 



J-2 a l 



(i-3) 



(i-2)! 

i— 3 i— 3— j 



(i-3)! 



+ ^ 



i-2 fl 2 



(i-3) 



(i-3)! 



j=l fc=0 ^ 



(i-3-j-fc) 



z — 3 — j — A:)! 



i-l 



J'=2 



Since 1 — 2a\ = y/1 — 4z we get: 



/ (i-2) (i-3) (i-3) 



(i-2)! 



(i-3)! 



(i-3)! 



\ 



+^ • Eg Ei=^'(-i)*(^i V~ ^ 

\ + 2 ' E,=2 OjOi-i+l 



(t-3-j-fc) 



(i-3-j-fc)! 



/VT^. (f) 



/ 



Corollary 5 Exact formulas for the functions a\-a 7 are given in Figure^ 

Proof: Let us first compute the function a\ which, according to Theorem |4| 
is given by 

a\ = za\ + 1, ai(0) = 1. 



By solving this equation, we obtain a\[z) = 1 ^ 4z , which is exactly the 
generating function for Catalan numbers — see, e.g., Chapter 1.1 of [5]. 
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Now, let us notice that on the basis of Theorem [4] all the other functions 
can be immediately obtained by tedious, however elementary, computations. 
In order to get exact values we applied SAGE software [15]. □ 

Let [z n ]f(z) denote the n-th coefficient of z n in the formal power series f(z) = 
Yl^=o fn 2 ™ '■ The theorem below (Theorem VI. 1 of [5]) serves as a powerful tool that 
allows us to estimate coefficients of certain functions that frequently appear in combina- 
torial considerations. 

Fact 6 Let a be an arbitrary complex number in C \ Z<o The coefficient of z n in 

f(z) = (i-zr 

admits the following asymptotic expansion: 

n a - x ( q(q-l) a (a - l)(q - 2)(3a - 1) 

[z]f{z) ~ rRi 1 + ^T + 

a 2 (a-l) 2 (q-2)(«-3) t f 1 

n 4 



48n 3 

where T is the Euler Gamma function defined for K(a) > as 



POO 

r(a) := / e-H a ~ l dt. 
Jo 



We can prove the following approximation. 
Proposition 7 



adz) ~ -. _ . . ,-. „ Wn men z — >• - 

7 2 3l ~ 5 (l - 42)(2»-3)/2 4 



where Cj t/ie z Catalan number. 

Proof: In this proof, when we write ~ or "is of order" we mean when z —> h. 
We prove the result by induction using Theorem [4j The result is true for 
2 = 1. For z > 1 and j < i, assume that aj(z) is of order ( . 1 _ 4 ^^ 2j _ 3)/2 and look 
at equation (f) to prove that a i+ i(z) is of order ^_ 4z ^ 2 i-i)/2 

Notice that the i th derivative of a± is of order ^_ 4z ^ 2i _ 1/2 , hence its {i — 2) th 
is of order (1 _ 4z) ( 2t _ 5)/2 and its (i - 3) th derivative is of order (1 _ 4z) ( 2l _ 7)/2 . 
Similarly the i th derivative of a 2 is of order ^34^+172 ; hence its (i — 3) th 



derivative is of order (1 _ 4z)(2l _ 5)/2 ■ 

By induction for j + 2 < i — 3, aj+2 is of order ( - 1 _ 4 ^^ 2j+1)/2 . Among its 
successive derivative, one derives at most i — 3 — j times, hence the items in 
the sum are of order at most ( . 1 _ 42 w 2i _ 5)/ / 2 . 

Hence the four first terms in (f ) do not contribute to the asymptotic value 
of a i+ i(z). Therefore the contribution to the asymptotic value is given by the 
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product ajaj_j- + i's, which are of order ^_ 4 ^_ 2 and when multiplied by -/j^g , 
the last sum is of order ( - 1 _ 4z - ) ^»-i)/2 • 

Let us call the multiplicative coefficient Ci_2/2 3l ~ 5 of ( , 1 _ 4 ^^ 2j _ 3)/2 . One 

notices that K 2 = \ = 23 ^°_ 5 . The sum 2 X^=2 a i a «-i+i shows the inductive 
part. Indeed when z — |: 

i-1 i-l 



4 23j-5 03(i-i+l)-5 

1 ^ 

j=0 

= £lzl = K . 

2 3i-5 



□ 

Theorem 8 



where 



= i (2fc-3)(2fc-5) | (2fc-3)(2fc-5)(2fc-7)(3fc-ll) 
8n 384n 2 
(2A;-3) 2 (2A;-5) 2 (2£;-7)(2£;-9) 1 

3672n 3 + V^' 

Proof: First recall that: 

r((2*-3)/2) = r ( (*-2) + i) = g£^. 

Now using Fact[6j we can compute the principal part: 

[z"]a k (z) = ^± n [z n ](l-zf^ 

c k . 2 r n( 2fc - 5 )/ 2 



2 3fc - 5 r((2Jfe-3)/2) 
C fc _ 2 (A;-2)!2 2 ( fe " 2 ) 



2 3fc-5 (2(jfc-2))!V7T 



4 n n (2fc-5)/2 



C fe _ 2 (A;-2)! 4n ^ (2fc _ 5 )/2 



2 fc - 1 (2(A;-2))!v / 7r 

4 n n (2fc-5)/2^ 



2 fc -!(A;- 1)!Vtt 
For \P(n, fe) we use Fact |6} with a = □ 
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By looking at Figure [3j we can easily notice a recurring pattern concerning the struc- 
ture of functions c^. Therefore, we state the following proposition. 



Proposition 9 For every i > 2 we have 



adz) = i ,„ -•: - + 



Qi{z) Ri{z) 



l-Azy- 2 (1-4*)* 
where Qi and Ri are polynomials over Z in z and degQi = |_^J and cleg Ri = I ■ 

Proof: By induction using formula (f), on the same vein as the proof of 
Proposition [7j In particular, the two first members of (t) are derivatives of 
the generating function of Catalan numbers studied in [9]. □ 

As we have already mentioned, the number of closed terms of size n is given by 
P n (0), which corresponds to the n-th term of the Taylor expansion of the function a n+ 2- 
Hence, the sequence of the numbers of closed lambda terms is equal to the sequence 
([z n }a n+2 (z)) n>0 . From Proposition [9j the number of closed terms of size n is equal to 
Qn+2(0) + R n+2 (0). Currently, we have no recursive formula for the Q n 's and the i? n 's. 
However from Proposition [7[ we know that 



R n 



+2 



1\ _ Cn_ 

4 / 2 n+1 ' 



6 Counting normal forms 

Beside counting terms, one can also count normal forms. To this end, we describe the 
set of normal forms as follows 

g m = x{ m )\sg m @T m 

F m = A J 7 m+ i i±i Q m 

Recall that a normal form is made by a sequence of abstractions on terms which is a 
variable (a de Bruijn index) applied to a sequence of normal forms. J- m represents the 
normal forms and Q m represents the terms starting with an index. From this we derive 
the formulas for counting: 

F , m = m 

Fn+l,m F n rn +\ -\- G n +l,m 



m 



5> 

fc=0 



—k,m"k,m 
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Like for terms we derive polynomials: 

NF P (m) = m 
NF P„ +1 (m) = NF P n (m + l) + NF Q n+1 (m) 

m 

n 
k=0 

Lemma 10 For every n, the degree of the polynomials NF P n and NF Q is equal to n + 1. 

Proof: Like the proof of Lemma [T], by induction on n from the definition of 
NF P n and NF Q. □ 

We have not derived the formulas for the coefficients yet. But these formulas are useful 
to derive generators of normal forms used in the rest of the paper. 

7 Lambda term generation 

From the simple equations defining the number T„ jm of terms we can define a function 
generating them. More precisely, we define a function T(k, n, m) which returns the k th 
term of size n with at most m variables (see the program in Figure [4]). The integer k 
belongs to the interval [l..P n (m)] which requires to handle big numbers. This program 
can be used to enumerate all the A-terms of size n with at most m distinct free variables. 
This is appropriate only for small values, since the number of A-terms is very large. But 
overall, in order to generate a random term of size n with at most m distinct free variables, 
it suffices to feed T with a random value k in the interval [l..P„(m)]. Similarly, one can 
define from the recursive formula for the number of normal forms a program for their 
generation. 

8 Simply typed terms 

Once we have a random generator for untyped terms, it is easy to build a random gener- 
ator for simply typed terms. It suffices to sieve the plain terms by a predicate, which we 
call is-typable. This predicate, which was implemented in SAGE [15] (i.e., in Python) like 
the rest of the programs, is a classical principal type algorithm [HI [2j [7]. For instance, 
applying the random generator with parameter 10 (for the size of the term), we got: 

A(A(((1 A(l)) A((3A(((1 2) 3))))))) 

This is a "typical" simply typed random closed lambda term of size 10 written with de 
Bruijn indices. Its type is 

((a ((09 -> /9) (a 7 ) ^ S) -+ 0) ^ 7 -> 
(09 ->• /9) (a 7) -> S) ->■ 6 



Qo(ny = 
F Q n+ i(m) = 
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Term(k,n,m) : 

if n = 0: return k 

elif k < P(n-l)(m+l): return ATerm(k,n-l,m+l) 
else: 

j :=0 

h := k - P(n-l)(m+l) 
while True: 

if h < P(j)(m) * P(n-l-j)(m): 
if h mod P(n-l-j)(m) = : 

return Term(h ~ P(n-l-j)(m),j,m) @ Term(P(n-l-j)(m), 

n-l-j,m) 

else: 

return Term(Lh 4- P(n-l-j)(m)J +l,j,m) @ Term((h mod 
P(n-l-j)(m)), n-l-j,m) 
else: 

h : = h - P(j)(m) * P(n-l-j)(m) 
j :=j + 1 



Figure 4: The program for term generation 



We were able to generate terms of size 50 (or of size about 80 provided we work in 
the model in which the size of each variable is 1). For such terms, the generating process 
is slow, since it requires 50 000 generations of terms, with (unsuccessful) tests of their 
typability before getting a typed one. But for size 40, for size about 65 if one would count 
also the variables, the number of attempts falls at 1000, which is reasonable. However, 
according to the distribution given in Figure |9j if one accepts a bias toward terms starting 
with abstractions, the search is easier (see Section 9.5) 

This kind of a random generator is useful for testing functional programs. Michal 
Palka [121 US] proposed a tool to debug Haskell compilers based on a lambda term gen- 
erator. His generator is designed on the development of a typing tree, with choices made 
when a new rule is created. Such a method needs to cut branches in developing the tree 
to avoid loops. This way his generator is not random, which may be a drawback in some 
cases. 



9 Experimental data 

Given a random term generator, we are able to write programs to make statistics on some 
features of terms. Experiments recorded here have been performed on a laptop with a 
2.4 GHz Intel Core i5 processor. 

9.1 Average depth of variables in terms 

Let us define the depth of a variable as the number of symbols (abstractions and ap- 
plications) between this variable and the top of the term. For instance, given the term 
Xx.(Xyz.x)(Xu.u), the variable x has depth 4, while the depth of u equals 3. In Figure [5j 
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we draw both the average depth of the variables for 300 throws of random terms of size 15 
up to 175 (scatted plot) and the curve (plain line). We also provide the comparison 
between the average depth of variables in normal forms for 300 throws of normal forms 
of size 15 up to 175 (scatted plot) and the same curve (Figure 6). On this basis, we 
conjecture that the average depth of variables in terms has an asymptotic upper bound 

2n 
ln(n) ' 



20 40 60 80 100 120 140 160 



Figure 5: Average depth of variables and curve 



2n 
ln(n) 



20 40 60 80 100 120 140 160 



Figure 6: Average depth of variables in normal forms and curve 



2n 
ln(n) 



9.2 Average number of head A's in terms 

We say that Xx is a head lambda in a term t if the latter is of the form \x\ . . . Xx n Xx.s 
for some positive integer n and a certain term s. In order to know the structure of an 
average term, we are interested in the average number of head A's occurring in terms. 
In Figure [7], we compare the average number of head A's in 400 random terms of size 
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15 to 250 with 



ufe- In Figure |8 
forms. Our experiments show tha 



we repeat this study in the case of random normal 
as concern head A's, terms and normal forms have 
approximatively the same shape, but the average number of head A's is slightly larger in 
the case of all terms than in the case of normal forms. 




20 40 60 



100 120 140 



Figure 7: Average number of head A's in terms and curve 



ln(n) 



20 40 60 80 100 120 140 



Figure 8: Average number of head A's in normal forms and curve 



9.3 Ratio of simply typed terms among terms 

It is interesting to investigate the ratio of simply typed terms among untyped ones. 
Actually, there are 454 283 lambda terms of size 8, whereas there are 43 977 lambda 
terms of size 7. Therefore, in the case of our implementation, 7 is the upper limit for an 
exhaustive computation of this ratio. The array below gives the ratio of simply typed 
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size 


8 


9 


10 


11 12 


13 14 15 16 20 30 


40 


45 


50 


ratio 


.216 


.178 


.143 


.111 .089 


.073 .056 .047 .039 .0014 .0012 


.0003 


.00005 


<10 -5 










Table 1: 


Ratio of simply typed terms 








size 


8 


9 


10 


11 12 


13 14 15 16 20 30 


40 


45 


50 


ratio 


.140 


.108 


.094 


.068 .057 


.048 .038 .029 .024 .0010 .0009 


.00006 


<10~ 5 


<10~ 5 



Table 2: Ratio of simply typed normal forms 



terms over plain terms by an exhaustive examination of the terms up to 7. 

size 4 5 6 7 

nb of terms 82 579 4 741 43 977 

nb of typed terms 40 238 1 564 11 807 

ratio 0.4878 0.4110 0.3299 0.2684 

After 8, we computed the ratio by the Monte Carlo method. The results are given in 
Table ffl 

We conclude that simply typed terms become very rare as the size of the terms grows, 
falling at less than one over 10000 when the size gets larger than 50. Like before, we have 
done the same task for normal forms. We got the ratio by an exhaustive examination of 
normal forms up to 7: 

size 4 5 6 7 

nb of terms 53 323 2 359 19 877 

nb of typed terms 23 106 587 3 789 
ratio 0.434 0.328 0.249 0.190 

and by the Monte Carlo method thereafter (see Table [2]). 



9.4 Distribution of simply typed lambda terms among terms 

We said that simply typed terms are rare, but we may wonder what rare means exactly. 
More precisely we may wonder how terms are distributed. To provide an answer to 
this question, we realized experiments for approaching the distribution of the frequency 
of typed lambda terms in segments of the interval [l..P„(0)]. For that we divided the 
interval [l..P„(0)] in segments and we computed on samples of randomly thrown terms, 
the ratio of simply typed terms over general terms we may expect in each segment. 
Figure [9] is typical of the results we got. This corresponds to an experiment on terms 
of size 25 on 250 segments with tests for simple typability on 200 random terms in each 
segment. It shows that the simply typed terms are not evenly distributed. They are more 
concentrated on the left of the interval corresponding to terms with low numbers. Those 
terms correspond to terms starting rather with abstractions than with applications and 
this is recursively so for subterms giving this impressions of rolling waves. For instance, 
there are 2% to 3% of typable terms (of size 25) starting with many abstractions, whereas 
for terms starting with many applications, there are large subintervals with almost no 



typable terms. Figure 10 which gives the same statistics for terms of size 30 shows that 



typed terms gets more rare as the size of the terms grows. 
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abstractions 



200 250 

applications 



Figure 9: Distribution of simply typed lambda terms of size 25. 250 segments on the 
horizontal axis, percentage (0% - 3%) of typable terms in segments on the vertical axis. 



50 

abstractions 



200 250 

applications 



Figure 10: Distribution of simply typed lambda terms of size 30. 250 segments on the 
horizontal axis, percentage (0% - 2.5%) of typable terms in segments on the vertical axis. 
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Figure 11: Distribution of simply typed normal forms of size 25. 250 segments on the 
horizontal axis, percentage (0% - 6%) of typable normal forms in segments on the vertical 
axis. 

The normal forms are even more scarcely distributed. As a comparison, we drew the 
same graphs for normal forms (size of the normal forms: 25 and 30, number of segments 



250, tests on 200 terms) in Figure 11 The typable normal forms aggregate more on the 



left of the interval where terms start mostly with abstractions, with peaks of 4% to 



by segments. Figure [12] shows that scarcity of typed normal forms increases as the size 
of terms grow. . 

9.5 Biased generation 

If we renounce full randomness, the distribution of simply typed terms provides us with a 
clue for getting large typed terms. We propose to call this a biased generation. This may 
be convenient if we look for a closed term of large size, not necessarily random. For that 
we search for term numbers in the low part of the interval [l..P n (0)]. In experiments, we 
have chosen for instance the first subinterval of size P n (0)/2 n . This way we were able to 
generate large closed simply typed lambda terms of size up to 200. With no surprise we 
got terms which are clearly not random, they have a third of the symbols as head A's and 
the rest as applications, that is that terms thrown that way have a very specific shape as 
they are sequences of abstractions followed by a sequence of applications. 
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Figure 12: Distribution of simply typed normal forms of size 30. 250 segments on the 
horizontal axis, percentage (0% - 1.45%) of typable normal forms in segments on the 
vertical axis. 

11 Conclusion 

The results we obtained open many tracks of research. We have to know more about the 
polynomials Qi and Ri in Conjecture |9j Here we have considered variables of weight 0, 
because it is simpler but still challenging and informative. A model with variable of 
weight 1 is worth studying and being compared with one presented here. This has been 
initiated in [TU] but the generation of random terms has not been considered. Moreover 
we have considered simple types (almost no term is simply typed). In further research we 
plan to focus on other type systems, e.g., system F. But in this case, counting methods 
will not apply, since typing is undecidable [18]. Perhaps like in 0, a method based on 
upper and lower approximations of the numbers of terms may apply. From those results 
we may expect to say something about beta reduction and its average efficiency in the 
untyped case as in the typed case. 

Lemma [2] gives recurrence formulas for the coefficients. We may exploit those results 
to derive a bivariate generating function for the coefficients p l n 's and then a formula or at 
least an asymptotic value of the p" +2, s which are the numbers of closed terms of size n. 
Finally, we study plain lambda terms, but it seems straightforward to study terms with 
specific types like nat and specific constants like sue : not — >■ not. This can be extended 
to languages with variable scopes, not necessarily functional programming languages. 
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A Number of terms with exactly m distinct free vari- 
ables 

In the rest of paper we were mostly interested by the numbers of terms with at most m 
free variables. Here we study the numbers of terms with exactly m distinct free variables, 
the formulas for counting those numbers and their relations with quantities we considered. 

A.l A formula 

Let us show how to derive the formula for counting A-terms with exactly m distinct free 
variables. This formula is adapted from a similar one when variables have weight 1 due 
to Raffalli {On-line Encyclopedia of Integer Sequences under number A135501). We 
assume that terms are built of usual variables (not de Bruijn indices) and that they are 
equivalent up to a renaming free variables and up to a-conversion. Let us denote the 
number of A-terms of size n with exactly m distinct free variables by f n ,m- 

Notice first that there is no term of size with no free variable, hence /o,o — 0. There 
is one term of size with one free variable, namely x, up to a renaming of the variables, 
hence /o,i — 1- The maximum number of variables for a A-term of size n is when the only 
operators are applications and all the variables are different. One has then a binary tree 
with n interior nodes and n + 1 leaves holding n + 1 variables. This means that for m 
beyond n + 1 variables there is no term of size n with exactly m distinct free variables. 
Hence 

fn,m — when m > n + 1. 

In the general case, a term of size n + 1 with m free variables starts either with an 
abstraction or with an application. Terms starting with an abstraction, say Xx, on a 
term M contribute in two ways, either M does not contain x as a free variables or M 
contains x as a free variable. There are / n>m such M's in the first case and f n , m +i i n the 
second. This gives the two first summands / n>m + f n ,m+i of the formula. Let us look how 
terms starting with an application look like. Assume they are of the form N P and of 
size n+1. For some p < n, the term N is of size p and P is of size n — p. Both these 
terms share c common variables (0 < c < m), while N P has m distinct free variables. 
N has k distinct free variables, which do not occur in P, hence N has k + c distinct free 
variables altogether. The term P has m — k distinct free variables. Therefore, given a set 
of own variables for N, a set of common variables, and a set of own variables for P, there 
are f p ,k+cfn- P ,m-k possible pairs (A, P). There are (™) ways to choose the c common 
variables among m and there are ( m ^°) ways to split the remaining variables into N and 
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P, namely k for N and m — c — k for P, hence the third summand of the formula: 

n m m—c / \ / \ 

y^y^y^ ( c ) ( L. ) fp,k+cfn-p,m-k- 
p=0 c=0 k=0 \ / \ / 

Now, we obtain the whole formula: 

n m m—c / \ / \ 

/n+l,m = fp,k+cfn—p,m—k ~\~ ^ ^ ^ ^ ^ ^ ( J ( » J fp,k+cfn 

p=0 c=0 fc=0 ^ C 7 ^ / 



I n—p,m—k- 



A. 2 Relations between T n m and / n m 

The number of terms of size n with exactly % indices in [l..m] is {^f)f n ,i- Therefore the 
number of terms with at most indices in [l..m] is: 



T 

- 1 n,m 



m / \ 

E ? '«<• 

i=0 v 7 



By the inversion formula ([6] p. 192), we get: 



fn,m / ( 1) ( )-^n,i- 

i=0 ^ 7 

This shows the non surprising fact that / n m and T n m are simply connected. Knowing 
that the T n m 's can be easily computed, this provides a formula simpler that Raffalli's to 
compute the / n ,m's. 
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